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1.  Introduction 

In  this  note  we  will  describe  a  simple  low-level  language  which  can  be  used 
to  write  ultracomputer  programs  of  the  synchronous  type,  i.e.  programs  oriented 
toward  the  coordinated  use  of  all  the  processors  of  an  ultracomputer. 

The  variables  of  this  language  all  have  values  which  are  vectors,  of  as  many 
components  as  the  ultracomputer  contains  processors.  We  will  regard  such  vec- 
tors as  being  homogeneous,  and  consider  their  components  as  being  bit  strings  of 
declared  length  (or  possibly  word  length  signed  integers  or  reals).  The  special 
constant  ID  is  the  vector  whose  components  are  0  ...  NPROC-l,  the  value  n 
being  assumed  in  the  n-th  processor.  A  constant  (e.g.  251)  designates  a  vector 
of  size  n  all  of  whose  components  have  the  indicated  constant  value. 

Generally,  but  not  always,  we  will  asstune  that  the  processors  of  the  ultra- 
computer  are  operating  in  synchronous  (or  single-instruction/multiple-data)  mode, 
i.e.  all  executing  the  same  instruction  during  each  instruction  cycle.  Thus,  for 
example,  in  writing  the  instruction  sequence 

DO  J  =  1  TO  100; 

A  (J)  =  B  (J)  +  C  (J)  •  D  (J)  +  X; 
END  DO; 

we  may  assume  that: 

(1)  J  designates  a  vector  of  NPROC  integers,  all  of  which  are  simultcmeously  ini- 
tialized, advanced,  tested,  etc.,  during  execution  of  the  loop  shown  above. 
Likewise,  X  designates  a  vector  of  NPROC  reals. 

(2)  A(J),  B(J),  and  C(J)  designate  arrays  of  vectors,  of  some  declared  intent 
DIM  X  NPROC  (e.g.,  100  x  NPROC).  All  the  separate  additions  and  mul- 
tiplications appearing  in  the  above  loop  (e.g.,  the  NPROC  multiplications 
C(J)  •  D(J)  (for  a  fixed  value  of  J)  are  performed  synchronously;  likewise 
the  NPROC  stores  indicated  by  A(J)  =  ...  (for  a  fixed  value  of  J). 


The  most  important  point  here  is  that  once  synchronous  operation  is  esta- 
blished it  will  persist  until  disrupted  by  a  conditional  transfer  Involving  a  (vector 
valued)  expression  whose  components  may  differ. 

To  be  specific,  we  shall  assume  the  syntax  and  semantic  environment  of  LIT- 
TLE (with  details  such  as  word  size,  real  arithmetic  format,  signed  integer  for- 
mat, etc.,  defined  by  the  particular  machine  used  as  the  individual  processor  of 
the  ultracomputer). 

Next  consider  an  IF  statement,  such  as 

IF  D  ^  NPROC/2  THEN 

J  =  J  +  1; 
ELSE 

J  =  J-1; 
END  IF; 

The  test  with  which  this  statement  begins  will  affect  certain  of  the  processors  one 
way  and  the  remainder  another,  thus  destroying  synchronization.  We  will  often 
wish  to  reestablish  synchronized  operation  at  the  end  of  such  a  conditional  block 
(and  sometimes  at  other  program  points  as  well).  For  this  purpose,  we  provide 
the  very  special  operation 

SYNCH; 
This  operation  delays  all  processors  which  execute  it  imtil  all  processors  have 
arrived  at  the  SYNCH  operation,  at  which  time  they  all  proceed  in  resynchron- 
ized  fashion.  (A  plausibly  efficient  hardware  implementation  of  this  operation 
will  be  suggested  below.)  Thus,  we  can  reestablish  synchronization  after  the  IF- 
statement  shown  above  by  placing  a  'SYNCH'  immediately  after  the  'END  IF. 
Because  this  construction  will  be  rather  common,  we  allow  the  combination 

END  IF; 
SYNCH; 

to  be  abbreviated  as 

ENDS  IF; 

Similarly,  we  allow  'ENDS  DO;'  and  'ENDS  WHILE;'  to  reestablish  synchronized 
operation  immediately  after  a  DO-loop  or  WHILE-loop  which  may  have  lost  syn- 
chronization. Moreover,  we  allow  'RETURNS'  to  reestablish  synchronization 
upon  return  from  a  subroutine  or  function. 

The  compiler  which  translates  our  parallel  language  into  ultraprocessor 
machine  code  is  allowed  to  replace  any  use  of  the  SYNCH  hardware  operation  by 
several  cycles  of  delaying  no-ops  if  the  precise  number  of  cycles  required  to 
traverse  the  different  branches  of  a  nest  of  IF  statements  can  be  precalculated. 
(For  example,  in  generating  code  for  the  IF  statement  shown  above,  the  compiler 
may  know  that  both  branches  of  the  IF  will  take  equally  long  to  execute,  so  that 
this  particular  IF  statement  may  preserve  synchronization,  even  though  other  IF- 
statements  may  not.    Having  recognized  this,  the  compiler  can  implement  the 
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'ENDS'  occurring  at  the  end  of  this  particular  IF  statement  just  as  if  it  were 
'END'.  Note  that  all  this  suggests  a  kind  of  synchronization-related  optimization 
that  can  probably  be  handled  by  standard  global  analysis  techniques.) 

We  use 

LEFTI  and  RIGHTI 

to  denote  the  left-shift  (vs.  right  shift)  of  the  NPROC-dimensional  vector  value  of 
I  among  processors.  Similarly,  we  use 

SHUFI  and  ISHUFI 

to  designate  the  shuffled  and  inverse-shuffled  value  of  I.  The  component  of  I 
held  in  the  N-th  processor  will  be  written  as  Ifn];  thus 

(SHUF  I)  [n]  =  I  [if  even  (n)  then  n/2  else  n/2+NPROC/2] 
(ISHUF  I)  [n]=  I  [if  n  <  N?ROCJ2  then  2n  else  2n-NPROC-l-l]. 

The  operations  SHUF  and  ISHUF  can  be  used  even  during  periods  of  ansynchro- 
nous  operation,  though  of  course  they  will  more  ordinarily  be  used  in  procedures 
whose  expected  effect  depends  on  synchronous  operation. 


2.   Code  for  Various  Primitives 

We  will  now  illustrate  the  flavor  of  the  proposed  programming  language  by 
writing  procedures  which  represent  various  of  the  important  primitives  reviewed 
in  [Schwartz,  79].   All  of  these  procedures  must  be  entered  synchronously 
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2.1.  Summing 

Given  X,  this  forms  SUM(X),  which  is  the  vector  of  partial  stuns  of  X. 

FUNC  SUM  (X);  $  PARTIAL  SUMS  0F  X 

REAL  X;  $  PARAMETER 

REAL  Y;  $  SCRATCH  QUANTITY 

SIZE  J(INDEX.SIZE);  $  AUXHJARY  INDEX 

$  THE  SYSTEM  QUANTITY  'LOGNPROC  IS  THE  LOGARiraM 

OF  'NFROC. 
Y  =  X 
IF  LOWBIT  ID  =  1  THEN  $  ODD  PROCESSOR 

Y  =  X  +  LEFT  X; 
ENDS  IF; 

DO    J  =  1  TO  LOGNPROC  -  1; 

Y  =  ISHUF  Y; 

IF  (LOWBIT  ID  =  1)  &  (HIGHBITS(J)  (-ID)  =  0)  THEN 
Y  =  Y  +  LEFT  Y; 

ENDS  IF; 
END  DO;  $  SYNCHRONIZATION  IS  PRESERVED  HERE 
DO    J  =  2  TO  LOGNPROC; 

Y  =  SHUFY 

IF  (LOWBIT  ID  =  0)  &  ID>  (NPROC-1-LOWBITS  (J)  (NFROC- 1)  THEN 
Y  =  Y  +  LEFTY; 

ENDS  IF; 
END  DO;    $  SYNCHRONIZATION  IS  PRESERVED  HERE 
RETURN  Y; 
END  FUNC  SUM; 

It  is  clear  that  partial  product,  maximum,  minimum,  etc.  routines  can  be 
written  to  the  same  pattern.  Note  also  that  the  Fast  Fourier  Transform  algorithm 
can  be  regarded  as  having  a  similar  pattern. 

2.2.  Taking 

The  procedure  RTAKE(X,MARK)  regards  the  components  of  the  quantity 
MARK  as  indicating  the  division  of  X  into  groups,  a  group  beginning  at  each 
component  n  such  that  MARK[n]  i^  0.  The  value  Y  returned  is  defined  by  Y[n] 
=  X[m],  where  m  is  the  largest  integer  not  greater  than  n  such  that  MARK[m]  i^ 
0.  The  following  code  shows  the  necessary  processing  pattern  (which  must  resem- 
bles 'summing'  ,  cf.  (a)  above).  Note  that  if  no  component  to  the  right  of  n  (i.e., 
with  smaller  index  than  n)  is  marked,  then  RTAKE(X,MARK)[n]  =  X[0]. 
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FUNC  RTAKE  (X,  MARK); 
REAL  X;       $  PARAMETER 

REALY;      $  SCRATCH  QUANTTIY  

SIZE  MARK  (WORD.SIZE);    $  MARK  PARAMETER 

SIZE  MARCOP  (WORD.SIZE);  $  D^OTLALLY  THIS  IS  A  COPY  OF  MARK,  BUT  IT 

WILL  CHANGE 
SIZE  J  (INDEX_SIZE)        $  AUXILIARY  INDEX 
Y  =  X;  MARKCOP  =  MARK; 
IF  (LOWBIT  ID  =  1)  TEEN 
IF  MARCOP  =  0  THEN 

Y  =  LEFTY; 

MARCOP  =  MARCOP  v  LEFT  MARCOP; 
END  IF; 
ENDS  IF; 
DO  J  =  1  TO  LOGNPROC  -  1; 

Y  =  ISHUFY; 

MARCOP  =  ISHUF  MARCOP; 

IF  (LOWBIT  ID  =  1)  &  (HIGHBITS(J)   (-ID)  =  0)  THEN 
IF  MARCOP  =  0  THEN 

Y  =  LEFTY; 

MARCOP  =  MARCOP  v  LEFT  MARCOP; 
END  IF  MARCOP; 
ENDS  IF; 
END  DO; 
DO  J  =  2  TO  LOGNPROC; 

Y  =  SHUFY; 

MARCOP  =  SHUF  MARCOP; 

IF  (LOWBTT  ID  =  0)  &  (ID  >  NPROC  -  1  - 

LOWBrTS(J-l)  (NPROC  -  1))  TEEN 
IF  MARCOP  =  0  TEEN 

Y  =  LEFTY; 

MARCOP  =  MARCOP  v  LEFT  MARCOP; 
END  IF  MARCOP; 
ENDS  IF; 
END  DO; 
RETURN  Y; 
END  FUNC  RTAKE; 

Various  other  useful  operations  can  be  expressed  in  terms  of  RTAKE.  For 
example,  to  transmit  the  value  of  X[0]  to  all  other  processors  we  have  only  to 
write  Y  =  RTAKE(X,ID=0).  It  is  easy  to  program  a  companion  LTAKE(X, 
MARK)  to  RTAKE  which  works  in  much  the  same  way  but  uses  the  first  mark 
to  the  left  rather  than  to  the  right  of  a  given  place.  Using  this  operator  and  the 
operator  MINS(X)  (analogous  to  SUM)  which  calculates  the  'partial  minima' 
(rather  thzin  partial  susm)  of  the  components  of  X,  we  can  put  the  minimum  of 
all  the  components  of  X  into  every  processor  simply  by  writing  Y  =  LTAKE 
(MINS(X),  ID  =  (NPROC  -  1)).  (As  a  matter  of  fact,  it  is  easy  to  write  a  pro- 
cedure having  this  same  effect  which  is  almost  twice  as  fast). 
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2.3.  Data  permutation 

As  noted  in  [Schw,  79],  an  arbitrary  permutation  f  of  NPROC  items  can  be 
factored  as 

where  D  =  LOGNPROC,  ct  designates  the  shuffle  map,  and  each  of  the  permu- 
tations g.  interchanges  some  of  the  collection  of  all  possible  even-odd  pairs  {2n, 
2n-l-l}  of  integers.  Accordingly,  each  map  g.  is  described  by  a  vector  quantity  G. 
of  dimension  NPROC  with  single-bit  components,  and  all  the  maps  G.  together 
define  a  vector  quantity  G  with  bitstring  components.  (We  assume  that  G[2n]  = 
G[2n-I-1]  for  all  n.)  Once  the  quantity  G  has  been  calculated,  the  permutation 
which  it  describes  can  be  effected  by  calling  the  following  rather  simple  fimction. 

FUNC  PERM  (X.G); 

REAL  X;     $  PARAMETER 

REAL  Y.Z;  $  SCRATCH  QUANTITIES 

SIZE  J  (INDEX_SIZE);  $  AUXILIARY  INDEX 

SIZE  ISODD  (1);  $  NONZERO  IF  ODD  PROCESSOR 

ISODD  F  =  LOWBrr  ID; 

Y  =  X' 

IF  (Brr(l)  G)  &  ISODD  TEIEN 

Y  =  LEFT  X; 
ELSEIF  BIT  (1)  G  THEN 

Y  =  RIGHT  X; 
ENDS  IF; 

DO  J  =  2  TO  NPROC; 
Z  =  SHUF  Y;  Y  =  Z; 
IF  (BIT(J)  G)  &  ISODD  THEN 

Y  =  LEFTZ; 
ELSEIF  (BIT(J)  G)  THEN 

Y  =  RIGHT  Z; 
ENDS  IF; 

END  DO; 

DO  J  =  NPROC  +  1  TO  2*NPROC  -  1; 

Z  =  ISHUF  Y;  Y  =  Z; 

IF  (BIT(J)  G)  &  ISODD  THEN 

Y  =  LEFT  Z; 
ELSEIF  (BIT(J)  G)  THEN 

Y  =  RIGHT  Z; 
ENDS  IF; 

END  DO; 
RETURN  Y; 
END  FUNC  PERM; 
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2.4.  Bitonlc  murging  and  sorting 

The  following  codes  represent  the  bitonic  murging  and  sorting  operations,  as 
described  in  [Schw,  79].  Note  that  MURGE  has  a  structure  closely  related  to 
that  of  PERM;  cf.  (c)  above. 

FUNC  MURGE  (X);    $  PERFORMS  BITONIC  MURGING; 
$  X[n]  IS  ASSUMED  TO  RISE  TILL 
$  n  =  NPROC/2  -  1,  THEN  FALL 
REALX;    $  PARAMETER 
REAL  Y,Z;  $  SCRATCH  QUANTITIES 
SIZE  J  (INDEXL.SIZE);  $  AUXHJARY  INDEX 
SIZE  ISODD(l);  $  NONZERO  IF  ODD  PROCESSOR 
ISODD  =  LOWBIT  ID; 
Y  =  X' 
DO  J  =  1  TO  LOGNPROC; 

Z  =  SHUF  Y;  Y  =  Z; 

IF  ISODD  &  Z  <  LEFT  Z  THEN 

Y  =  LEFTZ; 

ELSE  IF  -ISODD  &  Z  >  RIGHT  Z  TEEN 

Y  =  RIGHT  Z; 
ENDS  IF; 

END  DO; 

RETURN  Y; 

END  FUNC  MURGE; 

Bitonic  sorting  is  a  somewhat  more  complex  operation.  We  can  regard  it  as 
a  merging  sort  based  on  the  MURGE  operation  described  by  the  preceding  pro- 
gram (see  [Schw,  79];  [Knuth,  v.  II]).  However,  the  implementation  we  desire 
is  iterative  rather  than  recursive;  also  we  will  keep  reversing  the  signs  of  the 
quantities  being  merged  in  order  to  maintain  the  rising/falling  pattern  expected  by 
the  murge  suboperations.   The  code  is  as  follows: 

PROC  BITON  (X); 

REAL  X;      $  PARAMETER  TO  BE  SORTED 

REAL  Y,Z;  $  SCRATCH  QUANTITIES 

SIZE  J(INDEX_SIZE);  $  AUXHJARY  INDEX 

SIZE  ISODD(l);  $  NONZERO  IF  ODD  PROCESSOR 

ISODD  =  LOWBIT  ID; 

Y  =  X; 

IF  (BIT(l)  ID)  THEN  $  REVERSE,  TO  PRECONDITION 

Y=  -Y; 
ENDS  IF; 
DO  J  =  1  TO  LOGNPROC/2; 

IF  (BIT(J)  ID  9t  BIT  (J  +  1)  ID)  THEN 
Y  =  -  Y;       $  MUST  REVERSE  AGAIN  FOR  THIS  CYCLE 

ENDS  IF; 

DOK=lTOJ;    $  ISHUFFLE  REPEATEDLY  TO  PUT 
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Y  =  ISHUF  Y;  $  INTO  POSITION  FOR  MURGING 
END  DO; 
DOK=lTOJ;  $  THIS  NOW  RESEMBLES 

Z  =  SHUF  Y;  Y  =  Z;  $  TEIE  MURGE  LOOP 

IF  ISODD  &  Z  <  LEFT  Z  THEN 

Y  =  LEFTZ; 

ELSE  IF  -ISODD  &  Z  >  RIGHT  Z  THEN 

Y  =  RIGHT  Z; 
ENDS  IF; 

END  DO  K; 
END  DO  J; 

$  NOW  WE  BEGIN  THE  SECOND  HALF  OF  THE  PROCESSING, 
$  WHICH  ONLY  DIFFERS  FROM  THE  FIRST  HALF  TO  SAVE 
$  UNNECESSARY  ISHUFFLES  WHENEVER  THE  SAME  EFFECT  CAN 
$  BE  OBTAINED  BY  A  COMPLEMENTARY  NUMBER  OF  SHUFFLES. 
DO  J  =  LOGNPRC)C/2  +  1  TO  LOGNPROC; 
IF  (BTT  (J)  ID  7t  BTT  (J  +  1)  ID)  THEN 

Y  =  -  Y;  $  MUST  REVERSE  AGAIN  FOR  THIS  CYCLE 
ENDS  IF; 

DO  K=  1  TO  LOGNPROC- J;  $  NOW  FT  IS  MORE  EFFICIENT 

Y  =  SHUF  Y;  $  TO  SHUFFLE  REPEATEDLY 
END  DO; 

DOK=lTOJ;  $  AGAIN  THIS  RESEMBLES 

Z  =  SHUF  Y;  Y  =Z;  $  THE  MURGE  LOOP 
IF  ISODD  &  Z  <  LEFT  Z  THEN 

Y  =  LEFT  Z; 

ELSE  IF  -ISODD  &  Z  >  RIGHT  Z  THEN 

Y  =  RIGHT  Z; 
ENDS  IF; 

END  DO  K; 
END  DO  J; 
RETURN  Y; 
END  FUNC  BTTON; 

As  noted  in  [Schw,  79]  a  wide  variety  of  set-theoretic,  communication,  and 
other  functions  can  be  expressed  quite  directly  in  terms  of  BITON. 


3.   A  Software  Emulation  Technique 

A  reasonably  efficient  and  acceptably  accurate  software  emulation  for  pro- 
grams written  in  the  language  that  we  have  described  can  be  constructed  by  modi- 
fying the  LITTLE  compiler.  We  can  proceed  as  follows:  all  operations  except 
three  of  those  involving  interprocessor  communication  (namely  SYNCH,  SHUF, 
and  ISHUF)  essentially  the  ordinary  manner,  except  that  only  a  limited  number 
of  registers  (e.g.,  XI,  X6,  Al,  A6,  Bl,  B6  on  the  6600  are  used).  Arrays 
declared  to  be  of  dimension  N  are  treated  as  if  they  were  of  dimension  N^PROC; 
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simple  variables  are  treated  as  quantities  of  dimension  NPROC.  Status  informa- 
tion for  all  the  available  processors  (consisting  of  register  contents  and  instruction 
location  counter)  is  stored  in  an  array  of  appropriate  size.  The  ID  of  the 
currently  emulated  processor  is  held  in  some  specially  allocated  register. 

When  emulation  switches  to  a  given  processor,  the  register  contents  for  that 
processor  are  loaded  and  a  jimip  to  the  current  instruction  location  for  that  pro- 
cessor is  taken.  Execution  then  continues  in  standard  fashion  until  an  operation 
requiring  synchronization  is  encountered.  Preceding  any  such  operation,  we  com- 
pile a  call  to  a  special  'advance  processor'  procedure  which  stores  the  current 
register  and  ILC,  and  advances  to  emulation  of  the  next  processor.  When  all 
processors  have  reached  the  synchronization  point  we  let  them  proceed  (in  the 
case  of  a  SYNCH)  or  we  execute  a  shuffle  or  an  inverse  shuffle  (in  the  SHUF 
and  ISHUF  cases).    Then  emulation  begins  again  with  the  first  processor. 

Note  that  this  approach  avoids  the  overhead  of  interpretation.  It  comes  close 
to  emulating,  but  does  not  quite  capture  all  the  details  of,  the  hardware  synchron- 
ization semantics  described  earlier.  In  particular,  LEFT  and  RIGHT  may  not 
operate  quite  as  synchronously  as  desired,  whereas  SHUF  and  ISHUF  can  force 
synchronization  in  a  more  restrictive  manner  than  hardware  is  likely  to.  (But 
differences  should  only  be  seen  in  certain  esoteric,  even  if  conceivable,  cases.  In 
particular,  all  the  parallel  codes  listed  in  the  preceding  pages  will  execute  without 
difficulty.) 

E)eadlock  will  occur  if  two  different  processors  attempt  to  execute  operations 
requiring  synchronization  at  two  different  points.  Such  deadlock  can  easily  be 
detected  and  diagnosed  when  it  does  occur. 

4.   A  Remark  on  Hardware  Implementation  of  the  SYNCH  Primitive 

The  SYNCH  primitive  can  be  implemented  as  follows  (for  a  large  but  not 
enormous  collection  of  processors).  Let  I.  be  the  instruction  location  counter  of 
the  j-th  processor,  and  L  its  boolean  inverse.   Form 

NPROC  NPROC 


L 

S=     V 

k=l 


y    (BIT(k)Ij)  A    y    (BIT(k)Ij) 


where  L  is  the  length  of  I..  Then  S=0  if  and  only  if  the  processors  are  synchron- 
ized. Done  efficiently  by  dot-oring,  this  should  not  require  much  more  than  5-6 
levels  of  logic.   Processors  executing  SYNCH  can  simply  delay  until  S=0. 

It  is  also  easy  to  give  a  deadlock  check:  let  each  processor  executing  SYNCH 
drop  a  line  sy,  and  tie  all  those  lines  together  in  a  dot-or  tree  to  get  a  single  result 
syt  which  will  only  be  zero  if  all  processors  are  attempting  to  synchronize.  If  syt 
is  off  but  S  is  on,  a  deadlock  exists. 

A  similar  technique  can  be  used  to  measure  the  number  of  cycles  wasted  by 
processors  attempting  to  synchronize. 
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